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Abstract 

We provide a new perspective to understand proximal gradient algorithms. We show that both proximal gradient algo¬ 
rithm (PGA) and Bregman proximal gradient algorithm (BPGA) can be viewed as generalized proximal point algorithm 
(GPPA), based on which more accurate convergence rates of PGA and BPGA are obtained directly. Furthermore, 
based on GPPA framework, we incorporate the back-tracking line search scheme into PGA and BPGA, and analyze the 
convergence rate with numerical verification. 
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1. Introduction 

Proximal algorithms have been extensively studied in 
non-smooth convex optimization problems. These algo¬ 
rithms have broad applications in practical problems in¬ 
cluding image processing, e.g., [ll,:2j], distributed statisti¬ 
cal learning, e.g., Q, and low rank matrix minimization, 
e.g., 3 due to their simplicity. The key idea of proximal 
algorithms is to smooth the objective function via vari¬ 
ous smoothing techniques 3 0 ■ For example, the popular 
proximal point algorithm (PPA) applies a smoothing tech¬ 
nique based on Moreau envelope as shown below. 

Consider an optimization problem as follows: 

(PI) ming(x), (1) 

where g : 1" —>• R U {+oo} is a proper, closed and lower 
semicontinuous convex function, and C is a non-empty 
open convex set in R” with its closure being C. In [7j, 
PPA was introduced, which generates a sequence {a^} via 
the following iteration step: 

(PPA) x k+1 = argmin { g(x) + —-—\\x - x k \\l \ , (2) 
xec l J 


C x C and does not necessarily take the quadratic form. 
As a consequence, the iteration step of GPPA reads as 
follows: 

(GPPA) Zfc+i = argmin lg(x) + —^(x, x k ) \ . (3) 
xec l *k+i J 


The most popular choices of 'P are the Bregman distance 
hi [14, 151 and the (^-divergence in ITgI | . By exploiting the 


structures of these distance-like metrics, the convergence 
rate of GPPA was established in [2, [lj, S3- A unified 
framework to analyze GPPA with various choices of dis¬ 
tance metrics was proposed in 17]- 

Motivated by PPA, splitting algorithms have been de¬ 
signed and analyzed in [9;, [l8j]. Fi particular, consider the 
following problem: 


( p2 ) min {f(x)+g(x)}, (4) 

xGt n 

where / is a differentiable convex function with a l/q Lip- 
schitz continuous gradient, and g is a proper closed con¬ 
vex function. The prox imal gradient algorithm (PGA) has 
been developed in [19| to solve (P2), which has the itera¬ 
tion step given by 


where Xk are positive numbers. It was shown in [gj, @] that 
for properly chosen parameters Xk the sequence converges 
to a solution to (PI). The convergence rate of PPA was 
shown in [10- 

Various generalized proximal point algorithms (referred 
to as GPPA) have been proposed and analyzed, e.g., [ill . 
12A The idea is to replace the quadratic term l||x — 
rcfc||| in PPA by 'S'(x,Xk), where T is a metric defined on 


(PGA) : 

Xk+i = argmin \g{x) + (x, V/(x fc )) + — ||x - XfcHa} , 

iei“ l 277 J 

(5) 

where 77 is the step size satisfying g < 7 . It has been shown 
in [lj that PGA has a convergence rate of 0( l//cj3, and the 
rate can be further improved to be 0(1/k 1 2 ) via techniques 
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1 Here, f(n) = 0(g(n)) denotes that |/(n)| < £| 17 ( 71 )| for all n > 
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of acceleration. Similarly, PGA can also be generalized by 
replacing the quadratic term, i.e., the third term in 0, 
with the Bregman distance. The corresponding algorithm 
is referred to as BPGA [20j], and is given by 

(BPGA) : 

x k +i = argmin i g(x) + (x, V/(x fc )) + Di h (x, x k ) \ , (6) 

x£l“ t n ) 

where Di h is the Bregman distance (as defined in Defini- 

V 

tion [2] in Section 0 based on the convex function ^H. It 
has been shown in [20j that the convergence rate of BPGA 
is 0(l/k). 

It is clear that PGA and BPGA are more general algo¬ 
rithms than the original PPA and GPPA (with Bregman 
distance), because they reduce to PPA and GPPA when 
/ is a constant function. Thus, in existing literature, the 
convergence rates of PGA and BPGA are analyzed by their 
own as in elei without resorting to existing analysis of 
PPA and GPPA. In contrast to this conventional viewpoint 
of connections between PGA/BPGA and PPA/GPPA, the 
main contribution of this paper is to show that PGA and 
BPGA can, in fact, be viewed as GPPA with Bregman dis¬ 
tance metrics, and thus both are special cases of GPPA. 
Consequently, the convergence rate of both algorithms can 
be obtained directly based on that of GPPA. 

More specifically, we show that the sequences generated 
by PGA and BPGA are exactly the same as those gener¬ 
ated by GPPA with Bregman distance metrics associated 
with properly chosen functions. This provides a new per¬ 
spective that unifies PGA and BPGA into the framework 
of GPPA. To the best of our knowledge, such connection 
has not been established and reported in the existing lit¬ 
erature. Consequently, the convergence rate of PGA and 
BPGA follows directly from that of GPPA, which is a 
much simpler way for convergence analysis than existing 
approaches. Interestingly, the convergence rate obtained 
in this way is more accurate than that obtained by directly 
analyzing PGA as in Jl] and BPGA as in [20]. Moreover, 
based on the GPPA framework, we further incorporate the 
back-tracking line search scheme into PGA and BPGA, 
which is easier to implement in practice. We also analyze 
the convergence rate of such algorithms with line search 
and verify our analysis numerically. 

The rest of the paper is organized as follows. In Section 
0 we introduce necessary definitions and properties that 
are useful for our analysis. In Section 0 we present con¬ 
nections of PGA and BPGA to GPPA. In Section 0 we 
further develop PGA and BPGA with line search. Finally 
in Section 0 we conclude our paper with a few remarks on 
our results. 


2. Preliminaries 


Definition 1 (Proximal Distance). Let C be an open non¬ 
empty convex subset o/R". Let T : R n xR n —»• [0, + 00 ] be a 
continuous function with dom T(x, •) = C, dom T(-,y) = 
C and dom ViT(-,y) = C for every x £ C and y £ C. 
The function T is said to be a proximal distance if the 
following conditions hold: 

(a) T(-,?/) is differentiable and strictly convex for every 
V&C; 

(b) ’!'(•,?/) has bounded level set, i.e., for every a £ R, 
the set {x £ C : T(a;,y) < a} for every y £ C is 
bounded; 

(c) U/(a;,y) > 0 for all x £ C and y £ C, and equality 
holds iff x = y. 

Definition 2 (Bregman Distance). Let C be an open non¬ 
empty convex subset of R n . Let h : R” —>• R U {+ 00 } be 
proper with dom h = C. Assume h is continuous and 
strictly convex on C, and continuously differentiable on C 
with dom V/i = C. Then a distance metric D k is a Breg¬ 
man distance associated with function h if it is a proximal 
distance and has the following structure: 

L>h{x, y) = h(x) - h(y) -(x-y, Xh(y)), (7) 


for all x £ C and y £ C. 

Note that 0 and the convexity of h imply the non¬ 
negativity of Bregman distance. Let D k and Dh' be two 
Bregman distances. Then, for all a,b £ C, and c £ C, 0 
implies that the following two properties hold: 

• Three-point property: 

D h (c, a) + D h (a, b ) - D h (c, b ) = (V/i(6) - Vh(a), c - a). 

( 8 ) 

• Linearity: 

D h ( a,b)± D h > ( a,b) = D h ± h > ( a,b ). (9) 


One special GPPA scheme given in 0 is to set ’P to be 
a Bregman distance Dh■ In [17|, the convergence rate of 
such GPPA with Bregman distance was established, which 
we present below for convenience. 


Theorem 1 (Auslender and Teboulle 17], Theorem 2.1). 
Let g be a proper, closed, lower semicontinuous convex 
function and let Dh : C x C —> R+ be a Bregman dis¬ 
tance. Denote {x k } the sequence generated by GPPA with 
Bregman distance Dh- Assume dom g n C ^ 0 and the 
solution set X* of (PI ) is non-empty. Set a m = A k 

with do = 0. Then for all m > 0 the following convergence 
rate holds: 


In this section, we introduce definitions and properties 
that are useful for developing our results. 


g{x m ) - g(x*) < Dh ’ X °) , Vx*£X*. (10) 

& m 
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3. BPGA and PGA: A Special Case of GPPA 

In this section, we first analyze BPGA for problem (P2), 
and then specialize our result to its special case PGA. We 
first recall the iteration step of BPGA as follows for con¬ 
venience: 

(BPGA) 

27+1 = argrnin ] g(x) + (x, V/( x k )) + Di h (x, x k ) \ , 

xeK n v j 

where 77 is the step size satisfying 77 < 7 with 1/7 being 
the Lipscliitz constant of V/, and H is a differentiable and 
strongly convex function with dom H = C and dom Vi? = 
C. 

In the following, we reformulate BPGA as GPPA with 
a Bragman distance, based on which the convergence rate 
of BPGA can be established in a straightforward fashion. 

Theorem 2. Consider problem (P2) and set F = f + g. 
Let H be differentiable and strongly convex with parameter 
a > 77 / 7 . Assume dom F n C 7 ^ 0 . Then for any initial 
seed xq £ dom F n C, the sequence { 27 } generated by 
BPGA is the same as the one generated by the following 
GPPA iteration: 

x k+ i = a,rgmm{F(x) + D h (x,x k )} , (11) 

ieR” 


Remark 1. The constraint 77 < 7 in BPGA guarantees 
h to be a convex function. By choosing smaller 77 , the 
function H can be less strongly convex. Since rj can be 
chosen arbitrarily close to zero, the range of a in Theorem 
H can be as large as a > 0 . 

Since BPGA can be formulated as a special case of 
GPPA, the bound for GPPA in equation (flT)l) applies 
and we directly obtain the following convergence rate for 
BPGA. 

Corollary 1 . Let F = f + g and h = i H — f. Assume 
dom F C\ C 0. Then for all x* £ X*, the sequence 
{x k } generated by BPGA satisfies the following conver¬ 
gence rate: 

F(x k )-F(x*) < Dh ^ Xo \ ( 12 ) 

k 

Proof. The proof follows by applying Theorem[2j and iden¬ 
tifying < 7 , h, Afc, and 07 in equation chd as F, D h , 
jjH — /, 1 , and k, respectively. □ 

We note that if H = lj| • H^, then BPGA reduces to 
PGA. Hence the following corollary holds: 

Corollary 2. Denote F = f + g and h = A_|| ■ ||| — /. 
Then for any x* £ X*, the sequence { 27 } generated by 
PGA satisfies the following convergence rate: 


where D k is Bregman distance based on function h = ^Ft — 


/• 


F(x k )-F(x*) < 


D ^H 2 2~f {x *’ Xo) 

k 


Proof. We first show that h := A H — f is convex. To this 
end, it suffices to show that V/i is a monotone operator. 
Indeed, for any x,y £ dom / (~l C, the strong convexity of 
H and the Lipschitz continuity of V/ yield 

(Vi?( x) - VH{y),x-y) > a\\x - y\\\, 

(V/ (x) — Vf(y),x — y) < -\\x - y\\%. 

7 

Since a > 77 / 7 , the above two inequalities imply that 


for all k > 0 . 


We note that the convergence rate for PGA in Corollary 
[2] (referred to as GPPA-PGA rate) is more accurate than 
the one in [l] (referred to as PGA rate). We compare these 
two rates as follows: 


(GPPA-PGA rate) 
(PGA rate) 


X\\x* - x 0 \\l - D f (x*,x 0 ) 
k 



2:0 


2 

2 - 


(Vh(x) - Xh(y), x - y) > 0 . 

Hence, h is convex. 

Due to the linearity © and the definition of Bregman 
distance 0 , we obtain 

F(x)+D± H _ f (x,x k ) 

=f(x) + g{x) + Di h (x, x k ) - D f (x, x k ) 

V 

=g{x) + (x, V/( x k )) + Di h (x, x k ) 

V 

+ f(x k ) ~ {x k ,Xf{x k )). 


Although both rates are in the order of 0{l/k), it is clear 
that the GPPA-PGA rate is more accurate. We next pro¬ 
vide an example to illustrate this fact. 

Consider (P2) with / = A.\\x - &H 2 , and g = ||x||i, 
where b is a constant vector in IR n . It is clear that the 
Lipschitz constant of V/ is X Hence, by choosing 77 = 
7 , the iteration step of PGA is given by (after further 
simplification): 

27+1 = argrnin { ||x||i. + ^-||x - 6 ||\ \ . (13) 

That means that the algorithm converges in one step. 
Applying the GPPA-PGA rate with k = 1 and de¬ 
noting F = f + g, we have that F(xf) — F(x*) < 


Thus, 27+1 generated from BPGA is identical to the one 
generated by (HU). This completes the proof. □ 
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^\\x* — xo\\% — Df(x*,Xo) = 0 . We conclude that 
F(x i) = F(x*), and hence the algorithm does converge in 
one step. The PGA rate is clearly not tight here. There¬ 
fore, this example demonstrates that we establish a more 
accurate convergence rate of PGA by the new perspective 
of PGA as GPPA with a special Bregman distance metric. 
Similarly, the convergence rate of BPGA give n in Corol¬ 
lary [T] is in general tighter than the one in [20(. 

4 . BPGA with Line Search 

It can be observed that implementation of BPGA re¬ 
quires the knowledge of the Lipschitz constant 7 in ad¬ 
vance. If 7 is unknown a priori, we incorporate the idea of 
line search into BPGA, and adaptively set Lipschitz con¬ 
stant for each iteration step. More specifically, iteration 
step k of BPGA with line search is given as follows: 

• Set Xk+ 1 = argmin {+(2;) + Dh k (x, 27-)} with h k = 
—H — /; 

• If Dh k (xk+i,Xk ) > 0 , set hk+i = hk and go to itera¬ 
tion step k+ 1 ; Otherwise, set 77. t— ar)k , where a < 1 , 
and repeat iteration step k. 

We note that the line search criterion £>^(27+1,27) > 0 
can be intuitively understood as at iteration fc, we search 
for Dh k that behaves as a distance between Xk+i and 27-. 

The convergence rate of the BPGA with line search is 
characterized in the following theorem. 

Theorem 3 . Let F = / + g and h = -H — f. Assume 
dom F(lC ^ 0. Then for all x* £ X*, the sequence {27} 
generated by BPGA with line search as described above 
satisfies the following convergence rate: 

F(x m+ 1 )-F(x*)< D n H ^f y ( 14 ) 

Proof. The idea of the proof follows that in {jj. We provide 
the proof here for the completeness of the paper. We first 
note that the backtracking line search scheme gurantees 

07 < r)k- 

At iteration k, the argmin operation implies 

F(x k +i) + D hk (27+1,27) < F( 27), 

which combined with the line search criterion guarantees 
F(xk+i) < F(x-k ), be., a descent method. By convexity 
of / and 3, and set 57+1 £ dg(x k+ 1 ),F k+1 £ dF( 27+1)- 
Then for any x* £ X *, we have 

F{x*) > f(x k ) + (x* - 27, V/(x fc )) 

+ 3(27+1) + {x* - 27 +i, 3 fc+i) 

— +(27+1) + (2: 27+1, Tfc+i) 

+ D f (x*,Xk+i) — D f (x*,Xk). ( 15 ) 


Furthermore, by the optimality condition of iteration k+ 1 , 
we can choose F k+ i as 

Fk +1 =Xh k { 27 ) - Vhfc(27 +1 ) £ dF{x k+1 ). 

Substituting the above expression into m and applying 
three-point property of Bregman distance, then with cer¬ 
tain simplification we obtain 

F{x*) - F(x k +i) > —[D h ( 27,27+1) - D h ( 27,27)] 

Vk 

> — [D H (x*,x k +i) - D H {x+,x k )]. 
a'y 

Taking the sum of the above inequality over k = 0 ,..., m 
and applying the fact that F( 27+1) < F( 27), then with 
certain simplification we obtain the desired result <inii . □ 

Remark 2 . For the PGA with F = f + g and h k = ^7 1 | • 
H2 — /, the line search condition DhA 27+1,27) > 0 reduces 
to the back-tracking line search in fjj/. 

We next verify the convergence rate of BPGA with line 
search via a numerical experiment. Consider the following 
problem 

argmin f(x) = \\\Ax - b\\\, s.t. x £ A, 
xeR n * 

where A := {x £ R 1000 : J 2 ]=i x j = 1 ,2; > 0 } is the 
simplex constraint set. We generate one realization of 
A £ M 500x1000 and b £ JR 500 from the normal distribu¬ 
tion, then normalize b and coloums of A to unit length. 
We solve the problem by two BPGA formulations. In the 
first formulation, we set h k { x) = — f{x), and set 

3 to be the indicator function of the simplex set. In this 
case, BPGA reduces to PGA. In the second formulation, 
we set h k ( x) = ^ ^"=1 xfinxi — /( x), and set 3 to be the 
indicator function of the simplex set. In this case, BPGA 
corresponds to the mirror descent method [ 21 ] . We apply 
the line search criterion Dh k (x k +i,Xk) > 0 to both formu¬ 
lations and set rj 0 = 100 with the decay rate a = 0 . 5 . We 
compare both methods with line search to the correspond¬ 
ing methods with the constant stepsize 77 = 1 /A max (A T A). 
We initialize the algorithms with 2:0 = 10 ~ 3 liooc>j where 
liooo denotes the 1000-dimensional vector with all entries 
being one. 

Figure [l] demonstrates the convergence behavior of the 
four algorithms as the number of iterations changes. Four 
curves are plotted in this figure. The curves marked by 
“o” and are generated by PGA and the mirror descent 
method with constant stepsize, respectively. Accordingly, 
the curves marked by “+” and “ x ”, which mostly coincide 
with each other, are generated by PGA and the mirror 
descent method with line search, respectively. It can be 
observed that the line search scheme achieves faster con¬ 
vergence compared to the corresponding algorithm with 
constant stepsize. This is because a larger stepsize is used 
at each iteration of the line search algorithms. In particu¬ 
lar, line search significantly improves the convergence rate 
of the mirror decent algorithm. 
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Figure 1: Comparison of PGA and mirror descent with line search 
to those with constant stepsize. 

5. Conclusion 

In this paper, we provided a new perspective of PGA 
and BPGA, and showed that both algorithms can be 
viewed as GPPA with special choices of Bregman distance. 
As a consequence, a more accurate convergence rate was 
established in a straightforward way for both PGA and 
BPGA. This new perspective sheds light on the essence of 
PGA: by properly choosing the Bregman distance in the 
GPPA scheme, the smooth part / is linearized, and hence 
the iteration leads to evaluate the proximity operator of g. 
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